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Abstract 

Cardinality estimation algorithms receive a stream of elements whose order might 
be arbitrary, with possible repetitions, and return the number of distinct elements. 
Such algorithms usually seek to minimize the required storage and processing at the 
price of inaccuracy in their output. Real-world applications of these algorithms are 
required to process large volumes of monitored data, making it impractical to collect 
and analyze the entire input stream. In such cases, it is common practice to sample 
and process only a small part of the stream elements. This paper presents and an¬ 
alyzes a generic algorithm for combining every cardinality estimation algorithm with 
a sampling process. We show that the proposed sampling algorithm does not affect 
the estimator’s asymptotic unbiasedness, and we analyze the sampling effect on the 
estimator’s variance. 


1 Introduction 

Consider a very long stream of elements xi, 2 : 2 , xs,..., with repetitions. Finding the number 
n of distinct elements is a well-known problem with numerous applications. The elements 
might represent IP addresses of packets passing through a router [2111211 [33] , elements in a 
large database [29|, motifs in a DNA sequence [27], or nodes of RFID/sensor networks |36j . 
One can easily hnd the exact value of n by comparing the value of a newly encountered 
element, Xj, to every (stored) value encountered so far. If the value of Xi has not been seen 
before, it is stored as well. After all of the elements are treated, the stored elements are 
counted. This simple approach does not scale if storage is limited, or if the computation 
performed for each element Xi should be minimized. In these cases, the following cardinality 
estimation problem should be solved: 

The cardinality estimation problem 

Instance: A stream of elements xi,X 2 ,a; 3 ,... with repetitions, and an integer m. Let n be 
the number of different elements, namely n = |{a;i, 2 : 2 , X 3 ,.. .}|, and let these elements 
be {ci, 62 ,..., e„}. 


Objective: Find an estimate n of n using only m storage units, where m n. 

As an application example, xi,X 2 ,X 3 ,... could be IP packets received by a server. Each 
packet belongs to one of n IP flows ei, 62 ,..., e„, and the cardinality n represents the number 
of active flows. By monitoring the number of distinct flows during every time period, a router 
can estimate the network load imposed on the end server and detect anomalies. For example, 
it can detect DDoS attacks on the server when the number of flows signihcantly increases 
during a short time interval [mEi. 

Several algorithms have been proposed for the cardinality estimation problem [9l HOl UHl 
[2SIE21ES], all of which were designed to work on the entire stream, namely, without sampling. 
However, real-world applications are required to process large volumes of monitored data, 
making it impractical to collect and process the entire stream. For example, this is the case 
for IP packets received over a high-speed link, because a 100 Gbps link creates a 1 TB log 
hie in less than 1.5 minutes. In such cases, only a small part of the stream is sampled and 
processed 

In this paper we present and analyze a generic algorithm that adds a sampling process 
into every cardinality estimation procedure. The proposed algorithm consists of two steps: 
(a) cardinality estimation of the sampled stream using any known cardinality estimator; (b) 
estimation of the sampling ratio. We show that the proposed algorithm does not affect the 
original estimator’s asymptotic bias (accuracy), and we analyze the algorithm’s effect on the 
estimator’s variance (precision). 

A naive approach to solving the cardinality estimation problem is to estimate the cardi¬ 
nality of the sampled stream and view it as an estimation for the cardinality of the whole 
(unsampled) stream. However, this approach yields poor results because it ignores the prob¬ 
ability of elements that do not appear in the sample. For example, we simulated a stream 
of n = 10 , 000 distinct elements whose frequency in the stream follows uniform distribution 
~ U(10^, 10^). We then sampled 0.1% of the stream and used the HyperLogLog algorithm 
[To] with m = 200 storage units to estimate the cardinality of the sample. We repeated 
this test 200 times, each on a different stream of 10 , 000 distinct elements, and averaged 
the results. We found that the mean estimated cardinality is E[fi] 9,100, which means 
a bias of 9%, and that the relative variance is Var 0.0552. In contrast, our proposed 

algorithm computed a mean estimated cardinality of E[fi] 9,900, namely a bias of only 
1%, and a relative variance of only Var 0.0118. 

The rest of this paper is organized as follows. Section [2] discusses previous work. Section 
m presents our Erst algorithm (Algorithm [T]) for combining the sampling process with a 
generic cardinality estimation procedure. In addition, this section presents an analysis of 
the asymptotic bias and variance of Algorithm [H Section 0] presents our enhanced algorithm 
(Algorithmic]), which uses subsampling in order to reduce the memory cost of Algorithm [1] 
This section also presents an analysis of the asymptotic bias and variance of Algorithm |2l 
Section [5] presents simulation results that validate our analysis in Sections [3] and 01 Finally, 
Section [6] concludes the paper. 


2 Related Work 


Several works address the cardinality estimation problem [9l [101 ESI 1211 Ell [33] and propose 
statistical algorithms for solving it. These algorithms are efficient because they make only 
one pass on the data stream, and because they use a hxed and small amount of storage. 
The common approach is to use a random hash function that maps each element Cj into a 
low-dimensional data sketch h{ej), which can be viewed as a random variable. The hash 
function guarantees that h{ej) is identical for all the appearances of ej. Thus, the existence 
of duplicates, i.e., multiple appearances of the same element, does not affect the value of the 
extreme order statistics. Let h be a hash function and h{xi) denote the hash value of Xi. 
Then, an order statistics estimator or a bit pattern estimator can be used to estimate the 
value of n. An order statistics estimator keeps the smallest (or largest) m hash values. These 
values are then used to estimate the cardinality [6l [TU] [28] [SO] [32]. A bit pattern estimator 
keeps the highest position of the leftmost (or rightmost) “1” bit in the binary representation 
of the hash values in order to estimate the cardinality [91[I9]. 

Real-world applications of cardinality estimation algorithms are required to process large 
volumes of monitored data, making it impractical to collect and analyze the entire input 
stream. In such cases, it is common practice to sample and process only a small part of the 
stream elements. For example, routers use sampling techniques to achieve scalability. The 
industry standard for packet sampling is sFlow [1], short for “sampled flow”. Using a dehned 
sampling rate N, an average of 1 out of N packets is randomly sampled. The flow samples 
are then sent as sFlow datagrams to a central monitoring server, which analyzes the network 
traffic. 

Although sampling techniques provide greater scalability, they also make it more difficult 
to infer the characteristics of the original stream. One of the first works addressing inference 
from samples is the Good-Turing frequency estimation, a statistical technique for estimating 
the probability of encountering a hitherto unseen element in a stream, given a set of past 
samples. For a recent paper on the Good-Turing technique, see [22] . 

Several other works have addressed the problem of inference from samples. For example, 
the detection of heavy hitters, elements that appear many times in the stream, is studied in 
[5]. The authors propose to keep track of the volume of data that has not been sampled. 
Then, a new element is skipped only when its effect on the estimation will “not be too large.” 
The case where the elements are packets has also been addressed. In such cases, the heavy 
hitters are called elephants. The accuracy of detecting elephant flows is studied in [3l] and 
[35] . The authors use Bayes’ theorem for determining the threshold of sampled packets, 
which indicates whether or not a flow is an elephant in the entire stream. 

Other works have dealt with exploiting protocol-level information of sampled packets in 
order to obtain accurate estimations of the size of flows in the network. For example, in [TS] 
the authors present a TGP-specihc method whose estimate is based on the TGP SYN flag in 
the sampled packets. Another method, which uses TGP sequence numbers, is presented in 
|37j . These methods can also be used to estimate the cardinality of the flows in the network, 
i.e., the number of active flows. However, both methods are limited to TGP flows. In this 
paper we present a generic algorithm that does not make any assumptions regarding the 
type of the input elements. 

Related to the cardinality estimation problem is the problem of finding a uniform sample 


of the distinct values in the stream. Such a sample can be used for a variety of database 
management applications, such as query optimization, query monitoring, query progress 
indication and query execution time prediction la Eli- Additional applications of the 
uniform sample pertain to approximate query answering, such as estimating the mean, the 
variance, and the quantiles over the distinct values of the query [2|,|3l|26]. Several algorithms 
provide a uniform sample of the stream; for example, the authors of [ 2 S] show how to hnd 
such a sample in a single data pass. Several variations of this work are also proposed in 
[1111201123]. However, all the discussed approaches require scanning the entire input stream, 
which is usually impractical. In this paper we present a generic algorithm that does not 
require a full data pass over the input stream. 

The above works consider uniform packet sampling, where each packet is sampled with a 
hxed probability. Previous works have also dealt with size-dependent flow sampling, where 
packets are sampled with different probability, according to their flow size. The hrst works 
on size-dependent flow sampling study the problem of deciding which records in a given set 
of flow records should be discarded when storage constraints allow only a small fraction to 
be kept [131 EH ES]- The sampling decision in these works is made off-line: a flow is hrst 
received and only then discarded or stored. In [3T], the on-line version of this problem is 
studied. In this version, upon receiving a packet, the algorithm needs to determine whether 
to keep it. The authors develop a new packet sampling method that samples each packet with 
probability fi's), where / is a decreasing function of the estimated size of the corresponding 
how when the packet is received, and the size of the how is estimated using a small sketch 
that stores the approximate sizes of all hows. 


3 Cardinality Estimation with Sampling 

3.1 Preliminaries: Good-Turing Frequency Estimation 

The Good-Turing frequency estimation technique is useful in many language-related tasks 
where one needs to determine the probability that a word will appear in a document. 

Let X = {xi,a; 2 ,a; 3 ,...} be a stream of elements, and let E be the set of all diherent 
elements E = {ci, 62 ,..., e„}, such that Xi G E. Suppose that we want to estimate the 
probability vr(ej) that a randomly chosen element from X is ej. A naive approach is to 
choose a sample Y = {yi,y 2 ,...,?/;} of / elements from X, and then to let 7 r(ej) = 
where #(ej) denotes the number of appearances of ej in Y. However, this approach is 
inaccurate, because for each element ej that does not appear in Y even once (an “unseen 
element”), #(ej) = 0, and therefore 7i{ej) = 0. 

Let Ei = {ej\i^{ej) = i} be the set of elements that appear i times in the sample Y. 
Thus, \^i\ ■ i = I- The Good-Turing frequency estimation claims that Pi = {i + 1 ) 1 ^:^ is 
a consistent estimator for the probability Pi that an element of X appears in the sample i 
times. 

For the special case of Pq, we get from Good-Turing that Pq = \Ei\ jl. In other words, 
the hidden mass Pq can be estimated by the relative frequency of the elements that appear 
exactly once in the sample Y. For example, if 1/10 of the elements in Y appear only once 
in y, then approximately 1/10 of the elements in X do not appear in Y at all (i.e., they are 




unseen elements). 


3.2 The Proposed Algorithm 

We now show how to use Good-Turing in order to combine a sampling process with a 
generic cardinality estimation procedure, referred to as Procedure 1. As before, let X = 
{xi, a; 2 , Xs,...} be the entire stream of elements, and let Y = {yi, y 2 ,... ,yi} he the sampled 
stream. Assume that the sampling rate is P, namely, 1/P of the elements of X are sampled 
into Y. Let n and Ug be the number of distinct elements in X and Y respectively. The 
algorithm receives the sampled stream Y as an input and returns an estimate for n. The 
algorithm consists of two steps: (a) estimating Ug using Procedure 1 (any procedure, such 
as in [HI HSl I2S1E2]); (b) estimating n/n^, the factor by which to multiply the cardinality Ug 
of the sampled stream in order to estimate the cardinality n of the full stream. 

To estimate Ug in step (a). Procedure 1 is invoked using m storage units. To estimate 
n/ug in step (b), we note that Pq = (n — ng)/n and thus 1/(1 — Pq) = n/ug. Therefore, the 
problem of estimating n/ug is reduced to estimating the probability Pq of unseen elements. 
As indicated above, by Good-Turing, Pq = |Pi| /I is a consistent estimator for Pq. Thus, 
we only need to hnd the number \Ei\ of elements that appear exactly once in the sampled 
stream Y. To compute the value of |Pi| precisely, one should keep track of all the elements in 
Y and ignore each previously encountered element. This is done by Algorithm [T] below using 
0{l) storage units. We later show (Algorithm [2] in Section 0]) that the number of storage 
units can be reduced by estimating the value of \Ei \ /I. 

Algorithm 1 

(cardinality estimation with sampling) 

(a) Estimate the numherug of distinct elements in the sample Y by invoking a cardinality 
estimation procedure (Procedure 1) on this sample using m storage units. 

(b) Determine the ratio n/ug by computing where Pq = |Pi| //. The value of \Ei\ is 

computed precisely and I is known. 

(c) Return n = rig ■ n/ug as an estimator for the cardinality of the entire stream X. 

3.3 Analysis of Algorithm [1] 

In this section we analyze the asymptotic bias and variance of Algorithm [U assuming that 
the HyperLogLog algorithm [19] is used as Procedure 1. This algorithm is the best known 
cardinality estimator and it has a relative variance of Var ^ 1.08/m, where m is the 
number of used storage units. Our main result is Theorem [H where we prove that the 
sampling does not affect the estimator’s asymptotic unbiasedness, and we show the effect of 
the sampling rate P on the estimator’s variance. 

We start with three preliminary lemmas. The hrst lemma shows how to compute the 
probability distribution of a random variable that is a product of two normally distributed 
random variables whose covariance is 0: 



Lemma 1 (Product distribution) 

Let X and Y be two random variables satisfying X ^ Af {fix, Y ^ Af [/j,y, cr^), such 

that Cov [X, Y] = 0. Then, the product X ■ Y asymptotically satisfies the following: 

X - Y ^ Af {fixAy, + tI^I) ■ 

A proof is given in |38j . 

The next lemma, known as the Delta Method, can be nsed to compnte the probabil¬ 
ity distribntion for a fnnction of an asymptotically normal estimator nsing the estimator’s 
variance: 


Lemma 2 (Delta Method) 

Let 9m be seguence of random variables satisfying \/rn{6m — 9) ^ AT (0, a^), where 9 and a‘^ 
are finite valued constants. Then, for every function g for which g'{9) exists and g'{9) ^ 0, 
the following holds: 

\/m{g{9m) - g{9)) Af (^0, aV(6')^) ■ 

A proof is given in [38]. 

The last lemma states a normal limit law for the estimation of |T^i| /I, where |T^i| and I 
are as described in Section [3Tl 

Lemma 3 (Random Sample’s Coverage) 

1^/ ^ Af {\Efi /I, i((|Eii +2 1^21)// - (i^ii m) ■ 

A proof is given in [T8] . 

We are now ready to start onr analysis. Our hrst lemma summarizes the distribution of 

Po- 


Lemma 4 

Pq ^ Af (^Pq, y ^Po(1 ~ Pq) + where I is the sample size. 

Proof: 

For the expectation, the following holds 


E 



E[\E,\/l] = \E,\/l. 


The hrst equality is due to the dehnition of Pq in Algorithm [H and the second is because 
I Pi I and I are constants. By Good-Turing we get that |Pi| /Z —)■ Pq- 
For the variance, the following holds 


Var 


Pn 


= Var [|Pi| //] = 1/1 ■ ((|Pi| + 2 |P 2 |)/Z - (|Pi| /if). 


The hrst equality is due to the dehnition of Pq in Algorithm [H The second equality is due 
to Lemma[3l Finally, due to Good-Turing we get that 1/Z ■ ((|Pi| -l- 2 |p 2 |)/Z — (|Pi| /l)f —t 

i(Po(l-Po) + Pi 








As shown in [12] , when sampling is not used, Procedure 1 estimates n with mean value n 
and variance namely, n —)■ TV (^n, . The following theorem states the asymptotic bias 
and variance of Algorithm [D for P < 1. 

Theorem 1 

Algorithmic estimates n with mean value n and variance x namely, n —)■ 

N {n,^ + x)’ 'where I is the sample size, and m is the storage size used for 

estimating Ug. In addition, Pq and Pi satisfy: 

>■ E[ai = iEr.. 

«■ E[ni = £E”., 

where fi is the freguency of element e* in X. 


Proof: 

Applying the Delta Method (Lemma [2|) on 1 — Pq yields that 


1 1 ia(i-a) + -Pi 


l_p„ Vl-Po’i (l-PoY 
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( 1 ) 

( 2 ) 


The hrst equality is due to the Pq dehnition. The second equality is due to the law of total 
covariance. The third equality is because Ug and n/ug are independent when Ug is known. 
The fourth equality is due to the covariance dehnition. The hfth and sixth equalities are due 
to the expectation dehnition and algebraic manipulations. 



























Applying the distribution product property (Lemma dj) for Eqs. ([I]) and ([2]) yields that: 


n. 


n = 


1-Pf 




0 


n 


n'iPo{l-Po) + Pi nl 


’ I {l-PX 


+ 


m (1 - Pof 


Finally, substituting Ug = n ■ {1 — Pq) yields that: 


n 




n 


X Po(l - ^o) + Pi . n 

’ I (i-PoY 


2 

H- 

m 


The resulting asymptotic variance depends on both Pq and Pi, which are determined 
according to the sampling rate P and /*, the frequency of each distinct element in the 
stream. Thus, the hnal part of the proof is to compute their expectation. For Pq we get 
that: 


E[Po] 


1 

n 






1 

n 




1 

n ^ 

i=l 


The hrst equality is due to the expectation and Pq dehnitions. The second and the last 
equalities are due to algebraic manipulations. The third equality is due to the known limit 
result where {1 — xY^ ^ e~^ when x —?• 0 (in our case P —?• 0). 

For Pi we get that: 


n 1 ^ 



i=l 


The hrst equality is due to the expectation and Pi dehnitions. The second and third equalities 
are due to algebraic manipulations and the same known limit result noted above. ■ 


4 Reducing the Computational Cost of Algorithm [T] 

4.1 Algorithm [2] with Subsampling 

Algorithm [1] computes |Pi| precisely. To this end, it uses 0{l) storage units, which is linear 
in the sample size. We now show how to reduce this cost by approximating the value of |Pi| 
using a subsample U of the sample Y (see Figured]). 

Algorithm 2 

(cardinality estimation with sampling and subsampling) 

Same as Algorithm\I\ except that in step (b) the ratio \Ei \ /I is estimated by invoking Pro¬ 
cedure 2 using only u -C I storage units. 


Procedure 2: 

1. Uniformly subsample u elements from the sampled stream Y. Let this subsample be 

U. 







Figure 1: The relationship between X, Y and U 


2. Compute (precisely) the number \Ui\ of elements that appear only once in U. 

3. Return Pq = \Ui\ /u. 

The intuition behind Algorithm [2] is that the cheap operation of Algorithm [T|, estimating 
rig, is performed on the whole sample Y, whose length is Z, while the expensive operation, 
computing the number of elements that appear only once (|-Fi|), is performed on a small 
subsample U of length u, where u I. 

Uniform subsampling (step (1) in Procedure 2) can be implemented using one-pass reser¬ 
voir sampling [39], as follows. First, initialize U with the first u elements of Y, namely, 
yi,y 2 , ■ ■ ■ ,yui and sort them in decreasing order of their hash values. When a new element 
is sampled into Y, its hash value is compared to the current maximal hash value of the 
elements in U. If the hash value of the new element is smaller than the current maximal 
hash value of U, the new value is stored in U instead of the element with the maximal hash 
value. After all of the elements are treated and the sample Y is created, f/ is a uniform 
subsample of length u. 

We now analyze the running time complexity of Algorithm [2l Both steps (a) and (b) 
are performed using a simple pass over the sample Y, and require 0(1) operations per 
sampled element. Thus, these steps require 0(/) operations. Step (b) requires additional 
0{u) operations for each insertion of an element into U. On the average, there are O(logZ) 
such insertions. The total complexity is thus 0{l + u ■ log/) = 0(/), which is similar to that 
of Algorithm [TJ However, the main advantage of Algorithm [2] over Algorithm [1] is that it 
requires only m + u storage units, while Algorithm [1] requires m + I storage units, where 
u <^l. 

Next, we analyze the asymptotic bias and variance of Algorithm O assuming that the 
HyperLogLog algorithm [T9| is used as Procedure 1. Then we generalize the analysis for any 
cardinality estimation procedure. 






4.2 Analysis of Algorithm [2] 


Our main result is Theorem [2l which proves that the subsampling does not affect the asymp¬ 
totic unbiasedness of the estimator and analyzes the effect of the sampling rate P on the 
estimator’s variance, with respect to the storage sizes m and u. 

Let Zi be the set of elements that appear exactly i times in the subsample f/; thus, 
Y2\Zi\ ■ i = u and Zi is the set of elements that appear only once in U. \Zi\ can be written 
using indicator variables as: 


i=i 


where 




1 if the j’th element in U has a single appearance in the subsample 
0 otherwise. 


Consider the estimator \Ei \ /I for jZij ju. By definition, the variable \Z\ \ follows a hyper¬ 
geometric distribution, which can be relaxed to a binomial distribution if u -C / |38]. Thus, 
due to binomial distribution properties, the expectation is 


E 


Ell// I |Ei| =E[|Zi|/m] =E[J^-] = |Ei|//, 


(3) 


and the variance is 


Var 


Ell// I |Ei| = Var[|Ei|/M] = 1/w Var[/j] = l/w |Ei|//-(1 - |Ei|//). (4) 


The following lemma summarizes the distribution of Eq- 

Lemma 5 

Eq A/" (Pq, ^ f2Eo(l — Eq) -|- El 


Proof: 

For the expectation, the following holds 
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En 


= E 


\Ei\/l 


= E 


E 


Ell// I |Ei| =E[|Ei|//] = |Ei|//. 


The first equality is due to Procedure 2. The second equality is due to the law of total 
expectation. The third equality is due to Eq. [3l The fourth equality is due to Lemma [3l 
By Good-Turing we get that |Ei| // —)■ Eq. For the variance, the following holds: 


Var 

Po 

= Var 

|E 




= Var 
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1^11 


-F E 


Var 


|Ei|// I |Ei| 


= l/u ■ ((|Ei| + 2 IE 2 I)// - (|Ei| /If) + 1/u • |Ei| // ■ (1 - |Ei| //) 

= 2/u-((|Ei| + |E2|)//-(|Ei|//n. 

The first equality is due to Procedure 2. The second equality is due to the law of total 
variance. The third equality is due to Eq. 0] and Lemma [3l The fourth equality is due to 
algebraic manipulations. 

By Good-Turing we get that 2/u ■ ((|Ei| IE 2 I)// - (|Ei| /l)f -)■ ^^2Eo(l - Eq) -F Eij. 


























The following theorem states the asymptotic bias and variance of Algorithm [2] for P < 1. 

Theorem 2 

Algorithm d estimates n with mean value n and variance ^ namely, 

n ^ AT (n, ^ addition, Pq and Pi can he estimated as described in 

Theorem [II 


Proof: 

Applying the Delta Method (see Section [373|l on 1 — Pq yields that: 
1 . \f ( ^ 1 2Po(l - Pq) + Pi \ 

1-n Vi-n’« (1-W J’ 

According to |9]: 

Us ^ Af (ns,— 

\ m 



(5) 

( 6 ) 


Recall that Cov n^, 

(see Section I3.3p for Eqs. 


^ =0 (see Section I3.3p ; applying the distribution product property 

|5]) and ([6]) yields that: 


Us 


n = 


1-Pn 




n, 


nl 2Po(l-Po) + Pi 
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(1 - PoY 


+ 


n: 


m(l-Po)2 


Finally, substituting n* = n • (1 — Pq) yields that: 


n 


^Af 


n. 


2Po(l - Po) + Pi 


u 


(1 - PoY 


+ 


m 


The resulting asymptotic variance depends on both Pq and Pi, which are determined 
according to the sampling rate P and /*, the frequency of each distinct element in the 
stream, as was described in Section I3.31 ■ 

The analysis above assumes that the HyperLogLog algorithm [19] is used as Procedure 
1. Recall that the asymptotic relative efficiency (ARE) of cardinality estimator n is defined 
as the ratio ARE = ^ For example, the ARE of bottom-m sketches [2S] is 1.00, 

and the ARE of the maximal-term sketch in [9] is 0.93. The following theorem generalizes 
Theorem |2] for any cardinality estimation procedure. 


Theorem 3 

Algorithm\^ estimates n with mean value n and variance ^ namely, 

n —)■ Af ~ ; where ARE is the asymptotic relative efficiency of 

Procedure 1. In addition, Pq and Pi can he estimated as described in Theorem [21 


The proof is identical to that of Theorem |2l 
























5 Simulation Results 


In this section we validate our analysis for the asymptotic bias and variance of Algorithm 
[1] and Algorithm |2l as stated in Theorems [T] and [2] respectively. We implement both al¬ 
gorithms using the HyperLogLog [19] as Procedure 1, and simulate a stream of n distinct 
elements. Each distinct element Cj appears fj times in the original (unsampled) stream. 
These frequencies are determined according to the following models: 

1. Uniform distribution: The frequency of the elements is uniformly distributed between 
100 and 10,000; i.e., fj ~ U(102,10^). 

2. Pareto distribution: The frequency of the elements follows the heavy-tailed rule with 

shape parameter a and scale parameter s = 500; i.e., the frequency probability function 
is p{fj) = where a > 0 and fj>s> 0. The scale parameter s represents 

the smallest possible frequency. 

Pareto distribution has several unique properties. In particular, if a < 2, it has inhnite 
variance, and if a < 1, it has inhnite mean. As a decreases, a larger portion of the probability 
mass is in the tail of the distribution, and it is therefore useful when a small percentage of 
the population controls the majority of the measured quantity. 

Table [T] presents the simulation results for Algorithm [T] using uniformly distributed fre¬ 
quencies. The number of distinct elements is n = 10, 000. Thus, the expected length of 
the original stream X is 10, 000 ■ = 50.5 ■ 10®. We examine two sampling rates: 

P = 1/100 (Table [11(a)) and P = 1/1000 (Table [T](b)). We use different m values, and for 
every m average the results over 200 different runs. In each table row we present, for every 
m, the bias and the variance. The bias column is only from the simulations and it is always 
very close to 0, as proven in our analysis. For the variance we have two values: one from the 
analysis (Theorem [T]) and one from the simulations. 

The results in Table [1] show very good agreement between the simulation results and our 
analysis. First, as already said, the bias values are all very close to 0. Second, the simulation 
variance is always very close to the analyzed variance. 


m 

bias 

variance 

analysis 

simulation 

50 

0.0141 

0.0209 

0.0174 

100 

0.0094 

0.0114 

0.0099 

150 

0.0036 

0.0096 

0.0087 


m 

bias 

variance 

analysis 

simulation 

50 

0.0023 

0.0200 

0.0191 

100 

0.0134 

0.0100 

0.0116 

150 

0.0094 

0.0067 

0.0057 


(a) P = 1/100 (b)P = 1/1000 

Table 1: Simulation results for Algorithm [T] using uniformly distributed frequencies 

Next, we consider Algorithm |2] and seek to validate Theorem [2l Table [2] presents the 
simulation results for uniform distribution of the frequencies. The total storage budget is 
200 units, which are partitioned between m and u. The number of distinct elements is 
n = 10,000. We examine again two sampling rates: P = 1/100 and P = 1/1000. Table 
|3] presents results for the Pareto distribution of the frequencies, with a = 1.1, n = 10,000, 



















P = 1/100, and a total storage budget of 2,000 units. The results are averaged again over 
200 runs, and the variance from the analysis is determined according to Theorem [21 


m 

u 

bias 

variance 

analysis 

simulation 

10 

190 

0.0093 

0.1000 

0.1081 

50 

150 

0.0184 

0.0200 

0.0199 

100 

100 

0.0114 

0.0101 

0.0118 

150 

50 

0.0060 

0.0068 

0.0059 

190 

10 

0.0142 

0.0058 

0.0053 


m 

u 

bias 

variance 

analysis 

simulation 

10 

190 

0.0439 

0.1000 

0.1149 

50 

150 

0.0025 

0.0200 

0.0217 

100 

100 

0.0029 

0.0101 

0.0121 

150 

50 

0.0037 

0.0068 

0.0075 

190 

10 

0.0058 

0.0060 

0.0054 


(a) P = 1/100 (b)P = 1/1000 

Table 2: Simulation results for Algorithm [2] using uniform distribution and m + u = 200 
storage units 

In both tables we see again that the bias is indeed practically 0 and that the variance of 
the algorithm as found by the simulations is very close to the variance found by our analysis. 
These results are very consistent, for both frequency distributions, both sampling rates, and 
all m and u values. As expected, when m + u increases (more storage is used), the variance 
decreases. 


m 

u 

bias 

variance 

analysis 

simulation 

50 

1950 

0.00005 

0.0200 

0.0217 

100 

1900 

0.0189 

0.0100 

0.0104 

500 

1500 

0.0011 

0.0020 

0.0023 

1000 

1000 

0.00001 

0.0010 

0.0009 

1500 

500 

0.0107 

0.0007 

0.0006 


Table 3: Simulation results for Algorithm [2] using Pareto distribution and m + u = 2000 
storage units 

We now want to compare the performance of Algorithms [Hand HI Recall that Algorithm 
H] is expected to have a higher variance, but with signihcantly less storage. In Theorems [T] 
and [2] we got the following closed expressions for the relative variance of the algorithms: 

1. Algorithm m + 

2. Algorithm 0 + A. 

Recall that m + / is the total storage used by Algorithm [T] {I is the sample length), and 
m + M is the total storage used by Algorithm [H The probabilities Pq and Pi are determined 
according to the sampling rate P and the frequency distribution of the distinct elements in 
the stream (see Theorem [1]). Therefore, in a given stream, the only parameters that need to 
be determined by the user are m in Algorithm [T] and m and u in Algorithm HI In order to 
hnd the values of m and u that yield the minimal variance for a given input stream, one only 





































needs to know the sampling rate and then minimize the relative variance function stated 
above. 

Table m presents the simulation results for n = 10, 000, a uniform distribution of element 
frequencies, and for several sampling rates. Table IH^a) presents the variance of Algorithm [H 
In each table row we present the sample length I, the value of m, the total storage used by 
the algorithm [m + l), and the simulation variance (averaged over 200 different runs). Recall 
that in addition to m, Algorithm [T] uses 0{l) storage units for the exact computation of 
Table 111(b) presents the minimal variance of Algorithm |2] as a function of R. B indicates 
the total number of storage units we are willing to spend. In each table row we present the 
optimal partition of B between m and u that minimizes the variance of the estimator, and 
the simulation variance for these m and u values. For the case where P = 1 (no sampling), 
we provide in both tables the simulation variance of HyperLogLog [12], which we use as 
Procedure 1. This algorithm is the best known cardinality estimator and it has a relative 
variance of Var 1.08/m [19]. In this case we do not provide the values of I, m and u 

as there is no meaning to these parameters because sampling is not used. 


p 

storage 

variance 

(simulation) 

P 

storage 

variance 

(simulation) 

m 

1 

total 

B 

m 

u 

1/100 

100 

505,000 

505,100 

0.0116 

1/100 

100 

92 

8 

0.0112 

500 

505,500 

0.0018 

500 

460 

40 

0.0022 

1000 

506,000 

0.0009 

1000 

921 

79 

0.0009 

1/500 

100 

101,000 

101,100 

0.0095 

1/500 

100 

80 

20 

0.0126 

500 

101,500 

0.0021 

500 

401 

99 

0.0027 

1000 

102,000 

0.0008 

1000 

803 

197 

0.0011 

1/1000 

100 

50,500 

50,600 

0.0099 

1/1000 

100 

72 

28 

0.0152 

500 

51,000 

0.0019 

500 

363 

137 

0.0031 

1000 

51,500 

0.0008 

1000 

724 

276 

0.0013 

1 

100 

- 

100 

0.0101 

1 

100 

- 

- 

0.0101 

500 

- 

500 

0.0021 

500 

- 

- 

0.0021 

1000 

- 

1000 

0.0010 

1000 

- 

- 

0.0010 


(a) Algorithm [H (b) Algorithm [2] 


Table 4: Simulation results for Algorithms [T] and [2] using uniform distribution 


We can easily see from the tables that the storage-variance trade-off of Algorithm [2] is 
significantly better than that of Algorithm [H For example, the same variance (0.011) is 
obtained by both algorithms in the first row of P = 1/100. However, in this row Algorithm 
[Duses 505,100 storage units whereas Algorithm [2] uses only 100. For P = 1/500, we see that 
the same variance (0.002) is obtained by the two algorithms when Algorithm [D uses 101,500 
storage units while Algorithm |2] uses only 500. 


6 Conclusions 

In this paper we studied the problem of estimating the number of distinct elements in a 
stream when only a small sample of the stream is given. We presented Algorithm [D which 
combines a sampling process with a generic cardinality estimation procedure. The proposed 
algorithm consists of two steps: (a) cardinality estimation of the sampled stream using 













































any known cardinality estimator; (b) estimation of the sampling ratio using Good-Turing 
frequency. Then we presented an enhanced algorithm that uses subsampling in order to 
reduce the memory cost of Algorithm [TJ We proved that both algorithms do not affect the 
asymptotic unbiasedness of the original estimator. We also analyzed the sampling effect 
on the asymptotic variance of the estimators. Finally, we presented simulation results that 
validate our analysis and showed how to hnd the optimal parameter values that yield the 
minimal variance. 
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